NHGH analysis

Katrine Meldgård, Margrethe Bøe Lysø, Kristine Rosted Petersen, Enrico Leonardi and Pernille Jensen

Introduction

NHANES glycohemoglobin data

  • National Health and Nutrition Examination Survey

Diabetes Mellitus (DM)

  • Type 1 Diabetes: Inefficient production of insulin.

  • Type 2 Diabetes: Inefficient utilization of insulin.

  • 422 million diagnosed, 1.5 million deaths each year

Aim

  • Correlation between biomarkers/measurements and diabetes
  • Possibility of regaining values after medication
  • How income classes influence getting diabetes and medication

Methods

  • Raw data:
    6795 observations with 20 variables.
  • 01_load_data:
    Data has been loaded and splitted in metadata and measurements.
  • 02_clean_data (Data Wrangling):
    Age values converted to integers, column names replaced with more meaningful names, unusable values set to NA and the data sets have been re-joined into one.
  • 03_augment:
    Gender information converted to 0-1; conversions of units and introduction of ‘dm_status’ column.
# 03_augment.qmd
# Calculating "body fat percentage" and "conicity index".
# Add new variables
diabetes_data <- diabetes_data |> 
  mutate(bin_sex = case_when(
                    sex == "male" ~ 1,
                    sex == "female" ~ 0
                  ),
         bfp = 1.39*bmi+0.16*age-10.34*bin_sex-9,
         waist_m = waist/100,
         height_m = height/100,
         ci = waist_m/(0.109*sqrt(weight/(height_m))),
         dm_status = case_when(
                      diagnosis == 0 ~ 1,
                      diagnosis == 1 & medication == 0 ~ 2,
                      diagnosis == 1 & medication == 1 ~ 3
         ))

Descriptive analysis

Observations: 6795
Variables (augmented): 26
Diagnosed: 914
Medicated: 607

Data Visualization: Diabetes vs Income & Age

Biomarkers and diagnosis/medication status

Physical attributes and disease/medication status

PCA Analysis

  • Data
    • Non-medicated individuals
    • No observations with NA
    • Only anthropometric and biomarker measurements
  • Classes not seperated

Logistic regression model

  • Backwards selection:
    • Weight
    • Leg
    • Waist
    • Creatinine
    • Glycohemoglobin

Classification

  • Model based on parameters found by LR
  • Data
    • Non-medicated individuals
    • No observations with NA
  • ~all predicted as 0
  • AUC = 0.7750374

Discussion

  • Income classes and medication.
  • Medication effect.
  • Relation between anthropocentric- and biomarker measurements.
  • Classification and PCA: No clear relation.
  • Confusion matrix: Low prediction for diabetes.
  • Uneven distribution between diabetic and non-diabetic.
    • Improvement: Larger group of diabetic.

Conclusion:

  • Medication showed effect.
  • Modelling: No conclusion.